skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Creators/Authors contains: "Kilic, Ozgur Ozan"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Recent advances in virtualization technologies used in cloud computing offer performance that closely approaches bare-metal levels. Combined with specialized instance types and high-speed networking services for cluster computing, cloud platforms have become a compelling option for high-performance computing (HPC). However, most current batch job schedulers in HPC systems are designed for homogeneous clusters and make decisions based on limited information about jobs and system status. Scientists typically submit computational jobs to these schedulers with a requested runtime that is often over- or under-estimated. More accurate runtime predictions can help schedulers make better decisions and reduce job turnaround times. They can also support decisions about migrating jobs to the cloud to avoid long queue wait times in HPC systems. In this study, we design neural network models to predict the runtime and resource utilization of jobs on integrated cloud and HPC systems. We developed two monitoring strategies to collect job and system resource utilization data using a workload management system and a cloud monitoring service. We evaluated our models on two Department of Energy (DOE) HPC systems and Amazon Web Services (AWS). Our results show that we can predict the runtime of a job with 31–41 % mean absolute percentage error (MAPE), 14–17 seconds mean absolute value error (MAE), and 0.99 R-squared (R²) score. Having an MAE of less than a minute corresponds to 100 % accuracy since the requested time for batch jobs is always specified in hours and/or minutes 
    more » « less
    Free, publicly-accessible full text available March 1, 2027
  2. Managing and preparing complex data for deep learning, a prevalent approach in large-scale data science can be challenging. Data transfer for model training also presents difficulties, impacting scientific fields like genomics, climate modeling, and astronomy. A large-scale solution like Google Pathways with a distributed execution environment for deep learning models exists but is proprietary. Integrating existing open-source, scalable runtime tools and data frameworks on high-performance computing (HPC) platforms is crucial to address these challenges. Our objective is to establish a smooth and unified method of combining data engineering and deep learning frameworks with diverse execution capabilities that can be deployed on various high-performance computing platforms, including cloud and supercomputers. We aim to support heterogeneous systems with accelerators, where Cylon and other data engineering and deep learning frameworks can utilize heterogeneous execution. To achieve this, we propose Radical-Cylon, a heterogeneous runtime system with a parallel and distributed data framework to execute Cylon as a task of Radical Pilot. We thoroughly explain Radical-Cylon’s design and development and the execution process of Cylon tasks using Radical Pilot. This approach enables the use of heterogeneous MPI-Communicators across multiple nodes. Radical-Cylon achieves better performance than Bare-Metal Cylon with minimal and constant overhead. Radical-Cylon achieves (4 15)% faster execution time than batch execution while performing similar join and sort operations with 35 million and 3.5 billion rows with the same resources. The approach aims to excel in both scientific and engineering research HPC systems while demonstrating robust performance on cloud infrastructures. This dual capability fosters collaboration and innovation within the open-source scientific research community.Not Available 
    more » « less